We are now going to move into the realm of advanced RNN units. Up until this point, we have been dealing with the simple recurrent unit exclusively. However, implementing Rated units, Gated Recurrent Units, and Long Short-Term Memory will allow our models to become far more powerful.
A rated recurrent unit is simply a straightforward modification to the simple recurrent unit. Recall, a simple recurrent unit has the form:
Which can be observed on a lower level as:
And mathematically, we can define $h(t)$ as:
$$h(t) = f\big(W_x^T x(t) + W_h^T h(t-1)\big)$$
Where $f$ is our chosen activation function. The idea is that we we want to weight two things:
In order to accomplish this weighting, we use a matrix known as the rate matrix, $z$, which is the same size as the hidden layer. We then perform an element by element multiplication on each dimension:
$$h(t) = (1-z) \odot h(t-1) + z \odot f\big( x(t), h(t-1)\big)$$
$z$ is known as the rate. You may notice that this looks very similar to the low pass filter that we created in the previous post. The idea here to is to learn $z$ in a way that it let's the some values from previous points in time carry greater weight, and others depend mainly on the most recent values. Visually, the rated recurrent unit looks like:
The changes needed to implement this in code are relatively small and straightforward. We simply need to add a new param of size $M$, and include the above equation in the recurrence function.
There are more options for the rate matrix than simply what was discussed above. For example, we can make it dependent on the input and previous hidden state via a sigmoid and more weight matrices.
We can also make it a matrix (size $M x M$) so that it multiplies against $h$ with matrix multiplication, instead of element wise multiplication. In this case, you can picture what is happening as the same type of everything connected to everything scenario that we had for the hidden to hidden state.
The next thing that we are going to do is redo our poem generation example with the rated recurrent unit. It will be clear that this is just a small modification from the simple recurrent unit, so the code is almost exactly the same. Note, this new parameter will be trained in exactly the same way as all the other parameters-via gradient descent. In fact, when we look at more complex models the training will still remiain the same.
To add a bit more complexity to the mix, we are going to try and solve the problem of the lines ending too quickly. Recall that this is because we have many training samples that result in the next word being the end token. Because the end token shows up in every line, it is over represented in the training data. To prevent this, we will only train for going to the end of the sequence 10% of the time. Otherwise we stop on the second to last word.
Another change that we will make is that we will not model the initial distribution anymore. Recall, we use this to prevent every line from starting with the same word. This would happen because neural networks will give us the same prediction every time if we use the argmax. Instead, we will use the START token as the input, and the output will represent a probability distribution. This is valid because we already know that softmax gives us a valid probability distribution. We can then sample from this distribution to generate the next word. This way the sequences that we generate will be more stochastic, and it will treat the output probability as an actual probability, rather than a deterministic prediction:
p(w0) = softmax(f(START))
w0 = randint(V, p=p(w0))
import theano
import theano.tensor as T
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from rnn_util import init_weight, get_robert_frost
class SimpleRNN:
def __init__(self, D, M, V):
self.D = D # dimensionality of word embedding
self.M = M # hidden layer size
self.V = V # vocabulary size
def fit(self, X, learning_rate=10., mu=0.9, reg=0., activation=T.tanh, epochs=500, show_fig=False):
N = len(X)
D = self.D
M = self.M
V = self.V
# initial weights
We = init_weight(V, D)
Wx = init_weight(D, M)
Wh = init_weight(M, M)
bh = np.zeros(M)
h0 = np.zeros(M)
# z = np.ones(M)
Wxz = init_weight(D, M)
Whz = init_weight(M, M)
bz = np.zeros(M)
Wo = init_weight(M, V)
bo = np.zeros(V)
thX, thY, py_x, prediction = self.set(We, Wx, Wh, bh, h0, Wxz, Whz, bz, Wo, bo, activation)
lr = T.scalar('lr')
cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
grads = T.grad(cost, self.params)
dparams = [theano.shared(p.get_value()*0) for p in self.params]
updates = []
for p, dp, g in zip(self.params, dparams, grads):
new_dp = mu*dp - lr*g
updates.append((dp, new_dp))
new_p = p + new_dp
updates.append((p, new_p))
self.predict_op = theano.function(inputs=[thX], outputs=prediction)
self.train_op = theano.function(
inputs=[thX, thY, lr],
outputs=[cost, prediction],
updates=updates
)
costs = []
for i in range(epochs):
X = shuffle(X)
n_correct = 0
n_total = 0
cost = 0
for j in range(N):
if np.random.random() < 0.1:
input_sequence = [0] + X[j]
output_sequence = X[j] + [1]
else:
input_sequence = [0] + X[j][:-1]
output_sequence = X[j]
n_total += len(output_sequence)
# we set 0 to start and 1 to end
c, p = self.train_op(input_sequence, output_sequence, learning_rate)
# print "p:", p
cost += c
# print "j:", j, "c:", c/len(X[j]+1)
for pj, xj in zip(p, output_sequence):
if pj == xj:
n_correct += 1
if i % 50 == 0:
print("i:", i, "cost:", cost, "correct rate:", (float(n_correct)/n_total))
if (i + 1) % 500 == 0:
learning_rate /= 2
costs.append(cost)
if show_fig:
plt.plot(costs)
plt.show()
def save(self, filename):
np.savez(filename, *[p.get_value() for p in self.params])
@staticmethod
def load(filename, activation):
# TODO: would prefer to save activation to file too
npz = np.load(filename)
We = npz['arr_0']
Wx = npz['arr_1']
Wh = npz['arr_2']
bh = npz['arr_3']
h0 = npz['arr_4']
Wxz = npz['arr_5']
Whz = npz['arr_6']
bz = npz['arr_7']
Wo = npz['arr_8']
bo = npz['arr_9']
V, D = We.shape
_, M = Wx.shape
rnn = SimpleRNN(D, M, V)
rnn.set(We, Wx, Wh, bh, h0, Wxz, Whz, bz, Wo, bo, activation)
return rnn
def set(self, We, Wx, Wh, bh, h0, Wxz, Whz, bz, Wo, bo, activation):
self.f = activation
# redundant - see how you can improve it
self.We = theano.shared(We)
self.Wx = theano.shared(Wx)
self.Wh = theano.shared(Wh)
self.bh = theano.shared(bh)
self.h0 = theano.shared(h0)
self.Wxz = theano.shared(Wxz)
self.Whz = theano.shared(Whz)
self.bz = theano.shared(bz)
self.Wo = theano.shared(Wo)
self.bo = theano.shared(bo)
self.params = [self.We, self.Wx, self.Wh, self.bh, self.h0, self.Wxz, self.Whz, self.bz, self.Wo, self.bo]
thX = T.ivector('X')
Ei = self.We[thX] # will be a TxD matrix
thY = T.ivector('Y')
def recurrence(x_t, h_t1):
# returns h(t), y(t)
hhat_t = self.f(x_t.dot(self.Wx) + h_t1.dot(self.Wh) + self.bh)
z_t = T.nnet.sigmoid(x_t.dot(self.Wxz) + h_t1.dot(self.Whz) + self.bz)
h_t = (1 - z_t) * h_t1 + z_t * hhat_t
y_t = T.nnet.softmax(h_t.dot(self.Wo) + self.bo)
return h_t, y_t
[h, y], _ = theano.scan(
fn=recurrence,
outputs_info=[self.h0, None],
sequences=Ei,
n_steps=Ei.shape[0],
)
py_x = y[:, 0, :]
prediction = T.argmax(py_x, axis=1)
self.predict_op = theano.function(
inputs=[thX],
outputs=[py_x, prediction],
allow_input_downcast=True,
)
return thX, thY, py_x, prediction
def generate(self, word2idx):
# convert word2idx -> idx2word
idx2word = {v:k for k,v in word2idx.items()}
V = len(word2idx)
n_lines = 0
X = [ 0 ]
while n_lines < 4:
# print "X:", X
PY_X, _ = self.predict_op(X)
PY_X = PY_X[-1].flatten()
P = [ np.random.choice(V, p=PY_X)]
X = np.concatenate([X, P]) # append to the sequence
P = P[-1] # just grab the most recent prediction
if P > 1:
# it's a real word, not start/end token
word = idx2word[P]
print(word, end=" ")
elif P == 1:
# end token
n_lines += 1
X = [0]
print('')
def train_poetry():
sentences, word2idx = get_robert_frost()
rnn = SimpleRNN(50, 50, len(word2idx))
rnn.fit(sentences, learning_rate=1e-4, show_fig=True, activation=T.nnet.relu, epochs=2000)
rnn.save('RRNN_D50_M50_epochs2000_relu.npz')
def generate_poetry():
sentences, word2idx = get_robert_frost()
rnn = SimpleRNN.load('RRNN_D50_M50_epochs2000_relu.npz', T.nnet.relu)
rnn.generate(word2idx)
if __name__ == '__main__':
# train_poetry()
generate_poetry()
We are now going to introduce a unit that is more powerful than the ellman unit, the Gated Recurrent Unit, (GRU). GRU's were introduced in 2014, while LSTM's (which we will be going over next) were introduced in 1997. I have chosen to start with GRU's because they are slightly less complex, and thus a better place to begin. The GRU is very similar to LSTM, incorporating many of the same concepts, but having far fewer parameters, meaning it can train faster at a constant hidden layer size.
To start, let's go over the architecture of the GRU. Before we do that, however, I want to quickly recap the previous architectures that we have seen, particularly treating the hidden unit as a black box of sorts. This compartmental point of view will help us in extending the simple unit and rated unit to the GRU.
Feedforward Unit
In the simplest feedforward network, this black box simply contains some nonlinear function like $tanh$ or $relu$:
Simple Recurrent Unit
In a simple recurrent network, we just connect the output of the black box back to itself, with a time delay of one:
Rated Recurrent Unit
With the rated recurrent unit, we add a rating operation between what would have been the output of the simple recurrent network, and the previous output value.
We can think of of this new operation as a gate. Since it has to take on a value between 0 and 1, and the other gate has to take on the 1 minus that value, it is a gate that is choosing between two things: taking on the old value or taking on the new value. The result here is that we get a mixture of both.
Gated Recurrent Unit
Finally, we arrive at the gated recurrent unit! The architecure simply requires that we add one more gate:
The above diagram corresponds with the following mathematical equations:
$$r_t = \sigma \big(x_t W_{xr} + h_{t-1}W_{hr} + b_r \big)$$
$$z_t = \sigma \big(x_t W_{xz} + h_{t-1}W_{hz} + b_z\big)$$
$$\hat{h}_t = g \big(x_t W_{xh} + (r_t \odot h_{t-1})W_{hh} + b_h\big)$$
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \hat{h}_t$$
Note: $g$ represents an activation, and $\odot$ is the symbol for element wise multiplication. With that said, we are simply seeing more of the same thing that we have been encountering thus far; weight matrices multiplied by inputs and passed through non linear functions and gates.
Again, we see the update gate $z_t$ that we encountered with the rated unit. It still balances how much of the previous hidden value and how much of the new candidate hidden value combines to get the new hidden value.
The new component is $r_t$, or the reset gate, which has the exact same functional form as the update gate- all of its weights are the same size. However, it is position in the black box is different. The reset gate is multiplied by the previous hidden state value. It controls how much of the previous hidden state we will consider, when we create the new candidate hidden value. In other words, it has the ability to reset the hidden value.
For instance, consider the situation where $r_t$ is equal to 0, then we get:
$$\hat{h}_t = g \big(x_t W_{xh} + b_h\big)$$
This would be as if $x_t$ were the beginning of a new sequence. Note that this is not the full picture, since $\hat{h}$ is only a candidate for the new $h_t$, since $h_t$ will be a combination of $\hat{h}$ and $h_{t-1}$ (which is controlled by the updated gate $z_t$).
Now, if that all sounds a bit complex, there is a practical way to reason with it. At the most general level, we are simply adding more parameters to our model, and making it more expressive. Adding more parameters allows us to fit more complex patterns.
When it comes to implement a GRU in code, it should be relatively simple if you already understand the simple recurrent unit and the rated recurrent unit. We simply need to add more weights, and modify the recurrence. The next step to making our code better is to modularize it. We have mentioned that the GRU can be looked at as a black box. To do this, we can simply make the GRU into a class so that it can be abstracted away. By doing this, we can stack GRU's, meaning anywhere that a hidden layer could go, a GRU can go as well. This allows us to just think of it as a thing that takes an input and produces an output. The fact that it contains a memory of previous inputs is just an internal detail of the black box.
Some pseudocode is show below:
class GRU:
def __init__(Mi, Mo):
Wxr = random(Mi, Mo)
# ...
def recurrence(x_t, h_t1):
r = sigmoid(x_t.dot(Wxr) + h_t1.dot(Whr) + br)
# ...
return (1 - z)*h_t1 + z*hhat
def output(x):
return scan(recurrence, x)
And, a full class implementation:
import numpy as np
import theano
import theano.tensor as T
from rnn_util import init_weight
class GRU:
def __init__(self, Mi, Mo, activation):
self.Mi = Mi
self.Mo = Mo
self.f = activation
Wxr = init_weight(Mi, Mo) # Input into reset gate
Whr = init_weight(Mo, Mo) # Hidden into reset gate
br = np.zeros(Mo) # Bias reset gate
Wxz = init_weight(Mi, Mo) # Input to update gate
Whz = init_weight(Mo, Mo) # Hidden to update gate
bz = np.zeros(Mo) # Bias update gate
Wxh = init_weight(Mi, Mo)
Whh = init_weight(Mo, Mo)
bh = np.zeros(Mo)
h0 = np.zeros(Mo)
# Create theano variables
self.Wxr = theano.shared(Wxr)
self.Whr = theano.shared(Whr)
self.br = theano.shared(br)
self.Wxz = theano.shared(Wxz)
self.Whz = theano.shared(Whz)
self.bz = theano.shared(bz)
self.Wxh = theano.shared(Wxh)
self.Whh = theano.shared(Whh)
self.bh = theano.shared(bh)
self.h0 = theano.shared(h0)
self.params = [self.Wxr, self.Whr, self.br, self.Wxz, self.Whz, self.bz, self.Wxh, self.Whh, self.bh, self.h0]
def recurrence(self, x_t, h_t1):
r = T.nnet.sigmoid(x_t.dot(self.Wxr) + h_t1.dot(self.Whr) + self.br) # Reset gate
z = T.nnet.sigmoid(x_t.dot(self.Wxz) + h_t1.dot(self.Whz) + self.bz) # Update gate
hhat = self.f(x_t.dot(self.Wxh) +(r*h_t1).dot(self.Whh) + self.bh) # Candidate for h
h = (1 - z)*h_t1 + z*hhat
return h
def output(self, x):
# Output for this unit taking in an input sequence X. X is 2 dimensional: Time x Dimension
h, _ = theano.scan(
fn=self.recurrence,
sequences=x,
outputs_info=[self.h0],
n_steps=x.shape[0]
)
return h
We are now going to move on from the GRU and start working with the LSTM. LSTM stands for Long Short-Term Memory, and it is a type of recurrent unit that has become very popular in recent years due to its superior performance/ability to overcome the vanishing gradient problem. Each recurrent unit that we have dug into has gotten progressively more complex, and each incorporated concepts from the previous unit.
The LSTM is the most complex recurrent unit we will cover. However, that complexity does not involve abstract ideas or new mathematics; rather, the number of components that we are dealing with is higher. Essentially, we will now have three different gates, called the input, output, and forget gate. Additionally, we are going to add yet another internal unit that will exist alongside the hidden state, sometimes referred to as a memory cell. One way to think of this cell is that it takes the place of $\hat{h}$, or the candidate hidden value, when we talked about the GRU.
Now, the input gate, $i_t$, and forget gate, $f_t$, should remind you of the rate gate, $z_t$, from the GRU. Before, we were using $z_t$ and $1- z_t$; however, now we just have two seperate gates:
Input Gate
The input gate just controls how much of the new value goes into the cell:
$$i_t = \sigma\big(x_t W_{xi} + h_{t-1}W_{hi} + c_{t-1}W_{ci} + b_i\big)$$
Forget Gate
The forget gate controls how much of the previous cell value goes into the current cell value:
$$f_t = \sigma\big( x_t W_{xf} + h_{t-1}W_{hf} + c_{t-1}W_{cf} + b_f\big)$$
Candidate
The candidate for the new cell value looks a lot like what would be the simple recurrent unit's value, right before it gets multiplied by the input gate:
$$c_t = f_tc_{t-1} + i_t \;tanh\big(x_t W_{xc} + h_{t-1}W_{hc} + b_c\big)$$
Output Gate
Finally, the output gate takes into account everything: the input at time $t$, the previous hidden state, and the current cell value:
$$o_t = \sigma \big( x_t W_{xo} + h_{t-1}W_{ho} + c_t W_{co} + b_o\big) $$
New Hidden State
The new hidden state is just the $tanh$ of the cell value, multiplied by the output gate:
$$h_t = o_t \; tanh(c_t)$$
Now, we just introduced a substantial number of new parameters, so it may be useful to try and break down exactly what they are.
| Model Component | Parameters | Depends on |
|---|---|---|
| Input Gate | $W_{xi}$, $W_{hi}$, $W_{ci}$, $b_i$ | $x_t$, $h_{t-1}$, $c_{t-1}$ |
| Forget Gate | $W_{xf}$, $W_{hf}$, $W_{cf}$, $b_f$ | $x_t$, $h_{t-1}$, $c_{t-1}$ |
| Candidate Cell | $W_{xc}$, $W_{hc}$, $b_c$ | $x_t$, $h_{t-1}$ |
| Output Gate | $W_{xo}$, $W_{ho}$, $W_{co}$, $b_o$ | $x_t$, $h_{t-1}$, $c_{t}$ |
In total, we have 15 new weights and biases for this black box! We have come a long way from our original feedforward net, where we only had 1 weight and 1 bias. Another way that we can think of this is that as we add more parameters, we enable our model to be more expressive.
Now, from an architecture perspective, an LSTM has the high level form of:
Where $\hat{c}$ is just meant to represent $x_t W_{xc} + h_{t-1}W_{hc} + b_c$ in the candidate equation. Now, for some this diagram may be sufficient, however, I would prefer to have a bit more insight to the inner working of the LSTM at a lower level. This can be seen below:
And we can highlight the actual cell as follows:
I encourage you to take the equations the we defined earlier and follow them through the architecture diagram above to make sure they make sense. I should point out that in order to reduce visual noise in the diagram I didn't inclue the weight matrices or biases.
We can now implement an LSTM in code:
import numpy as np
import theano
import theano.tensor as T
from rnn_util import init_weight
class LSTM:
def __init__(self, Mi, Mo, activation):
self.Mi = Mi
self.Mo = Mo
self.f = activation
Wxi = init_weight(Mi, Mo)
Whi = init_weight(Mo, Mo)
Wci = init_weight(Mo, Mo)
bi = np.zeros(Mo)
Wxf = init_weight(Mi, Mo)
Whf = init_weight(Mo, Mo)
Wcf = init_weight(Mo, Mo)
bf = np.zeros(Mo)
Wxc = init_weight(Mi, Mo)
Whc = init_weight(Mo, Mo)
bc = np.zeros(Mo)
Wxo = init_weight(Mi, Mo)
Who = init_weight(Mo, Mo)
Wco = init_weight(Mo, Mo)
bo = np.zeros(Mo)
c0 = np.zeros(Mo)
h0 = np.zeros(Mo)
# Create Theano variables
self.Wxi = theano.shared(Wxi)
self.Whi = theano.shared(Whi)
self.Wci = theano.shared(Wci)
self.bi = theano.shared(bi)
self.Wxf = theano.shared(Wxf)
self.Whf = theano.shared(Whf)
self.Wcf = theano.shared(Wcf)
self.bf = theano.shared(bf)
self.Wxc = theano.shared(Wxc)
self.Whc = theano.shared(Whc)
self.bc = theano.shared(bc)
self.Wxo = theano.shared(Wxo)
self.Who = theano.shared(Who)
self.Wco = theano.shared(Wco)
self.bo = theano.shared(bo)
self.c0 = theano.shared(c0)
self.h0 = theano.shared(h0)
self.params = [
self.Wxi,
self.Whi,
self.Wci,
self.bi,
self.Wxf,
self.Whf,
self.Wcf,
self.bf,
self.Wxc,
self.Whc,
self.bc,
self.Wxo,
self.Who,
self.Wco,
self.bo,
self.c0,
self.h0,
]
def recurrence(self, x_t, h_t1, c_t1):
i_t = T.nnet.sigmoid(x_t.dot(self.Wxi) + h_t1.dot(self.Whi) + c_t1.dot(self.Wci) + self.bi)
f_t = T.nnet.sigmoid(x_t.dot(self.Wxf) + h_t1.dot(self.Whf) + c_t1.dot(self.Wcf) + self.bf)
c_t = f_t*c_t1 + i_t*T.tanh(x_t.dot(self.Wxc) + h_t1.dot(self.Whc) + self.bc)
o_t = T.nnet.sigmoid(x_t.dot(self.Wxo) + h_t1.dot(self.Who) + c_t.dot(self.Wco) + self.bo)
h_t = o_t*T.tanh(c_t)
return h_t, c_t
def output(self, x):
# input X should be a matrix (2-D)
# rows index time
[h, c], _ = theano.scan(
fn=self.recurrence,
inputs=x,
outputs_info=[self.h0, self.c0], # Include c so that recurrence has access to it!
n_steps=x.shape[0]
)
return h
We are now going to take our RNN/LSTM and utilize it on a set of data from wikipedia. Our goal is going to be to create a language model just as we had done for poetry. Both simply hold sequences of words, but wikipedia has a much larger vocabulary and total size (13 GB). We are going to see if our learned word embeddings end up producing useful word analogies, and specifically visualize them on a scatter plot (after performing dimensionality reduction).
Overall, this section does not have a ton of new information in terms of theory/RNNs, but it focus's more on a how it would be applied in a real scenario. As such, I want to take a minute to go over how we actually would get the data. It can be found here. Once you have arrived, you can decide whether you want to get the entire datate set (14.7 GB at this point), or if you would rather just deal with a specific set of pages articles. I am choosing to work with a set of pages articles (i.e. not the entire dataset), because it will require some careful thought to prevent crashing python by overloading it's memory.
Next, we are going to need to convert the above data into a text file format. To do this we will use the wp2txt library. We can install it by running the following:
sudo gem install wp2txt
And it can then be run on a single file via:
wp2txt -i <filename>
Finally, we can talk about how we are going to take our text files and get it into the right format for our neural network. To do this we will be using the function get_wikipedia_data:
def get_wikipedia_data(n_files, n_vocab, by_paragraph=False):
"""Converts Wikipedia txt files into correct format for Neural Network
This function takes in a number of files that is too large to fit into memory if all data is loaded
at once. 100 or less seems to be ideal. The vocabulary also needs to be limited, since it is a lot
larger than the poetry dataset. We are going to have ~500,000-1,000,000 words. Note that the output
target is the next word, so that is 1 million output classes, which is a lot of output classes.
This makes it hard to get good accuracy, and it will make our output weight very large. To remedy
this, the vocabulary size will be restricted to n_vocab. This is generally set to ~2000 most
common words.
Args:
n_files: Number of input files taken in
n_vocab: Vocabulary size
by_paragraph:
Returns:
sentences: list of lists containing sentences mapped to index
word2idx_small: word2index mapping reduced to size n_vocab
"""
wiki_relative_path = f'{relative_path}wikipedia/unzipped'
input_files = [f for f in os.listdir(wiki_relative_path) if f.startswith('enwiki') and f.endswith('txt')]
# Return Variables
sentences = []
word2idx = {'START': 0, 'END': 1}
idx2word = ['START', 'END']
current_idx = 2
word_idx_count = {0: float('inf'), 1: float('inf')}
print(input_files)
print(len(input_files))
if n_files is not None:
input_files = input_files[:n_files]
for f in input_files:
print('Reading: ', f )
for line in open(f'{wiki_relative_path}/{f}'):
line = line.strip()
# Don't count headers, structured data, lists, etc
if line and line[0] not in ('[', '*', '-', '|', '=', '{', '}'):
if by_paragraph:
sentence_lines = [line]
else:
sentence_lines = line.split()
for sentence in sentence_lines:
tokens = my_tokenizer(sentence)
for t in tokens:
if t not in word2idx:
word2idx[t] = current_idx
idx2word.append(t)
current_idx += 1
idx = word2idx[t]
word_idx_count[idx] = word_idx_count.get(idx, 0) + 1
sentences_by_idx = [word2idx[t] for t in tokens]
sentences.append(sentences_by_idx)
# Reduce vocabulary size to n_vocab
sorted_word_idx_count = sorted(word_idx_count.items(), key=operator.itemgetter(1), reverse=True)
word2idx_small = {}
new_idx = 0
idx_new_idx_map = {}
for idx, count in sorted_word_idx_count[:n_vocab]:
word = idx2word[idx]
print(word, count)
word2idx_small[word] = new_idx
idx_new_idx_map[idx] = new_idx
new_idx += 1
# Let 'unknown' be last token
word2idx_small['UNKNOWN'] = new_idx
unknown = new_idx
assert('START' in word2idx_small)
assert('END' in word2idx_small)
assert('king' in word2idx_small)
assert('queen' in word2idx_small)
assert('man' in word2idx_small)
assert('woman' in word2idx_small)
# Map old idx to new idx
sentences_small = []
for sentence in sentences:
if len(sentence) > 1:
new_sentence = [idx_new_idx_map[idx] if idx in idx_new_idx_map else unknown for idx in sentence]
sentences_small.append(new_sentence)
return sentences_small, word2idx_small
A few things to note about the above function:
Now, I should now that if we train on the entire dataset there is a good chance that we will run out of memory. To prevent this, we don't want to load all of our data into memory at the same time. One strategy would be to train on each separate text file, opening each within each epoch. We would have to, of course, keep a dictionary of the word2idx mapping handy, so that it remains consistent between each file. This would be rather slow of course.
Another option, would be to convert each sentence into a list of word index's before running the neural network, and then save the word2idx mapping to a file as well. That way our input data can be just a bunch of arrays of word index's. This would still need to be in multiple files since it would still take up too much RAM, but we would prevent needing to convert the data each time we open the file.
A final option would to use a DB like MySQL in order to store the arrays of word indexes. This way we could retrieve rows at random, without needing to store them in memory.
We will now implement this is code:
import sys
import theano
import theano.tensor as T
import numpy as np
import matplotlib.pyplot as plt
import json
from datetime import datetime
from sklearn.utils import shuffle
from gru import GRU
from lstm import LSTM
from rnn_util import init_weight, get_wikipedia_data
class RNN:
def __init__(self, D, hidden_layer_sizes, V):
"""RNN class constructor
Args:
hidden_layer_sizes: Number of units in each hidden layer
D: Embedding size
V: Vocabulary size
"""
self.hidden_layer_sizes = hidden_layer_sizes
self.D = D
self.V = V
def fit(self, X, learning_rate=1e-5, mu=0.99, epochs=10,
show_fig=True, activation=T.nnet.relu, RecurrentUnit=GRU, normalize=True):
"""Fit RNN class to data set via unsupervised learning (no Y labels)
Args:
X: Input samples (sentences)
learning_rate: learning rate
mu: momentum
epochs: Number of training iterations
show_fig: Show figure
activation: non linear activation function
RecurrentUnit: GRU or LSTM
normalize: Normalizes word embeddings
Returns:
"""
D = self.D
V = self.V
N = len(X) # Number training samples (number of sentences)
We = init_weight(V, D)
self.hidden_layers = []
Mi = D
for Mo in self.hidden_layer_sizes: # Create hidden layers
ru = RecurrentUnit(Mi, Mo, activation) # Create recurrent unit (from class GRU or LSTM)
self.hidden_layers.append(ru)
Mi = Mo
Wo = init_weight(Mi, V)
bo = np.zeros(V)
# Create theano variables
self.We = theano.shared(We)
self.Wo = theano.shared(Wo)
self.bo = theano.shared(bo)
self.params = [self.Wo, self.bo]
# For each recurrent unit in hidden_layers, append the params
for ru in self.hidden_layers:
self.params += ru.params
# Create theano inputs and outputs
thX = T.ivector('X') # Sequence of word indexes
thY = T.ivector('Y') # Sequence of word indexes, offset for thX by 1
Z = self.We[thX] # Input sequence
for ru in self.hidden_layers:
Z = ru.output(Z)
py_x = T.nnet.softmax(Z.dot(self.Wo) + self.bo)
prediction = T.argmax(py_x, axis=1)
# Let's return py_x too so we can draw a sample instead
self.predict_op = theano.function(
inputs=[thX],
outputs=[py_x, prediction],
allow_input_downcast=True,
)
# Create cost and do gradient descent
cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
grads = T.grad(cost, self.params)
dparams = [theano.shared(p.get_value()*0) for p in self.params]
# Gradient descent for word embeddings
dWe = theano.shared(self.We.get_value()*0)
gWe = T.grad(cost, self.We)
dWe_update = mu*dWe - learning_rate*gWe
We_update = self.We + dWe_update
if normalize:
We_update /= We_update.norm(2)
updates = [
(p, p + mu*dp - learning_rate*g) for p, dp, g in zip(self.params, dparams, grads)
] + [
(dp, mu*dp - learning_rate*g) for dp, g in zip(dparams, grads)
] + [
(self.We, We_update), (dWe, dWe_update)
]
self.train_op = theano.function(
inputs=[thX, thY],
outputs=[cost, prediction],
updates=updates
)
# Main training loop
costs = []
for i in range(epochs):
t0 = datetime.now()
X = shuffle(X)
n_correct = 0
n_total = 0
cost = 0
for j in range(N): # one sentence at a time
# 1% of time, will include END token
if np.random.random() < 0.01 or len(X[j]) <= 1:
input_sequence = [0] + X[j]
output_sequence = X[j] + [1]
else:
input_sequence = [0] +X[j][:-1]
output_sequence = X[j]
n_total += len(output_sequence)
# test:
try:
# we set 0 to start and 1 to end
c, p = self.train_op(input_sequence, output_sequence)
except Exception as e:
PYX, pred = self.predict_op(input_sequence)
print("input_sequence len:", len(input_sequence))
print("PYX.shape:",PYX.shape)
print("pred.shape:", pred.shape)
raise e
cost += c
for pj, xj in zip(p, output_sequence):
if pj == xj:
n_correct += 1
if j % 200 == 0:
# Using sys here helps compact the output.
sys.stdout.write("j/N: %d/%d correct rate so far: %f\r" % (j, N, float(n_correct) / n_total))
sys.stdout.flush()
print("i:", i, "cost:", cost, "correct rate:", (float(n_correct) / n_total), "time for epoch:",
(datetime.now() - t0))
costs.append(cost)
if show_fig:
plt.plot(costs)
plt.show()
def train_wikipedia(we_file='word_embeddings.npy', w2i_file='wikipedia_word2idx.json', RecurrentUnit=GRU):
sentences, word2idx = get_wikipedia_data(n_files=20, n_vocab=2000)
print("finished retrieving data")
print("vocab size:", len(word2idx), "number of sentences:", len(sentences))
rnn = RNN(30, [30], len(word2idx))
rnn.fit(sentences, learning_rate=1e-5, epochs=10, show_fig=True, activation=T.nnet.relu)
np.save(we_file, rnn.We.get_value())
with open(w2i_file, 'w') as f:
json.dump(word2idx, f)
def find_analogies(w1, w2, w3, we_file='word_embeddings.npy', w2i_file='wikipedia_word2idx.json'):
We = np.load(we_file)
with open(w2i_file) as f:
word2idx = json.load(f)
king = We[word2idx[w1]]
man = We[word2idx[w2]]
woman = We[word2idx[w3]]
v0 = king - man + woman
def dist1(a, b):
return np.linalg.norm(a - b)
def dist2(a, b):
return 1 - a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))
for dist, name in [(dist1, 'Euclidean'), (dist2, 'cosine')]:
min_dist = float('inf')
best_word = ''
for word, idx in word2idx.items():
if word not in (w1, w2, w3):
v1 = We[idx]
d = dist(v0, v1)
if d < min_dist:
min_dist = d
best_word = word
print("closest match by", name, "distance:", best_word)
print(w1, "-", w2, "=", best_word, "-", w3)
if __name__ == '__main__':
we = 'models/lstm_word_embeddings2.npy'
w2i = 'models/lstm_wikipedia_word2idx2.json'
# train_wikipedia(we, w2i, RecurrentUnit=LSTM)
find_analogies('king', 'man', 'woman', we, w2i)
find_analogies('france', 'paris', 'london', we, w2i)
find_analogies('france', 'paris', 'rome', we, w2i)
find_analogies('paris', 'france', 'italy', we, w2i)
We can see that the embeddings that the RNN yielded do not seem to have the relationships we would like. However, if we utilize a dimensionality reduction technique, we may be able to improve upon this! Note, the above embeddings are based off of one iteration of training. If we had used 10, we would have improved significantly. On my local machine, training one iteration took 90 minutes.
Below, I will use the interactive plotting library, plotly, in junction with dimensionality reduction via TSNE to plot our word embeddings. We can then tinker with the plot to see what types of embeddings our network learned.
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
# Using cufflinks in offline mode
import cufflinks
cufflinks.go_offline()
# Set the global theme for cufflinks
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)
import warnings
warnings.filterwarnings('ignore')
import json
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
We = np.load('models/gru_nonorm_part1_word_embeddings.npy')
V, D = We.shape
with open('models/gru_nonorm_part1_wikipedia_word2idx.json') as f:
word2idx = json.load(f)
idx2word = {v:k for k,v in word2idx.items()}
model = TSNE()
Z = model.fit_transform(We)
x = Z[:,0]
y = Z[:,1]
text = [idx2word[i] for i in range(V)]
color = [i for i in range(V)]
trace = go.Scatter(
x=x,
y=y,
text=text,
mode='markers',
marker=dict(
size=12,
color=color,
colorscale='Viridis',
showscale=True
)
)
data = [trace]
layout=go.Layout(
title='Word Embeddings with Dimensionality Reduction',
hovermode='closest',
xaxis=dict(title='Principal Component 1'),
yaxis=dict(title='Principal Component 2')
)
fig = go.Figure(data=data, layout=layout)
# py.iplot(fig)
plotly.offline.plot(
fig,
include_plotlyjs=False,
output_type='div'
)